Search CORE

261 research outputs found

Unified Representation of Molecules and Crystals for Machine Learning

Author: Huo Haoyan
Rupp Matthias
Publication venue
Publication date: 02/01/2018
Field of study

Accurate simulations of atomistic systems from first principles are limited by computational cost. In high-throughput settings, machine learning can potentially reduce these costs significantly by accurately interpolating between reference calculations. For this, kernel learning approaches crucially require a single Hilbert space accommodating arbitrary atomistic systems. We introduce a many-body tensor representation that is invariant to translations, rotations and nuclear permutations of same elements, unique, differentiable, can represent molecules and crystals, and is fast to compute. Empirical evidence is presented for energy prediction errors below 1 kcal/mol for 7k organic molecules and 5 meV/atom for 11k elpasolite crystals. Applicability is demonstrated for phase diagrams of Pt-group/transition-metal binary systems.Comment: Revised version, minor changes throughou

arXiv.org e-Print Archive

KOPS - The Institutional Repository of the University of Konstanz

MPG.PuRe

Distance phenomena in high-dimensional chemical descriptor spaces : consequences for similarity-based approaches

Author: Rupp Matthias
Schneider Gisbert (Prof. Dr.)
Publication venue
Publication date: 09/06/2009
Field of study

Hochschulschriftenserver - Universität Frankfurt am Main

Kern-basierte Lernverfahren für das virtuelle Screening

Author: Rupp Matthias
Publication venue
Publication date: 05/02/2010
Field of study

We investigate the utility of modern kernel-based machine learning methods for ligand-based virtual screening. In particular, we introduce a new graph kernel based on iterative graph similarity and optimal assignments, apply kernel principle component analysis to projection error-based novelty detection, and discover a new selective agonist of the peroxisome proliferator-activated receptor gamma using Gaussian process regression. Virtual screening, the computational ranking of compounds with respect to a predicted property, is a cheminformatics problem relevant to the hit generation phase of drug development. Its ligand-based variant relies on the similarity principle, which states that (structurally) similar compounds tend to have similar properties. We describe the kernel-based machine learning approach to ligand-based virtual screening; in this, we stress the role of molecular representations, including the (dis)similarity measures defined on them, investigate effects in high-dimensional chemical descriptor spaces and their consequences for similarity-based approaches, review literature recommendations on retrospective virtual screening, and present an example workflow. Graph kernels are formal similarity measures that are defined directly on graphs, such as the annotated molecular structure graph, and correspond to inner products. We review graph kernels, in particular those based on random walks, subgraphs, and optimal vertex assignments. Combining the latter with an iterative graph similarity scheme, we develop the iterative similarity optimal assignment graph kernel, give an iterative algorithm for its computation, prove convergence of the algorithm and the uniqueness of the solution, and provide an upper bound on the number of iterations necessary to achieve a desired precision. In a retrospective virtual screening study, our kernel consistently improved performance over chemical descriptors as well as other optimal assignment graph kernels. Chemical data sets often lie on manifolds of lower dimensionality than the embedding chemical descriptor space. Dimensionality reduction methods try to identify these manifolds, effectively providing descriptive models of the data. For spectral methods based on kernel principle component analysis, the projection error is a quantitative measure of how well new samples are described by such models. This can be used for the identification of compounds structurally dissimilar to the training samples, leading to projection error-based novelty detection for virtual screening using only positive samples. We provide proof of principle by using principle component analysis to learn the concept of fatty acids. The peroxisome proliferator-activated receptor (PPAR) is a nuclear transcription factor that regulates lipid and glucose metabolism, playing a crucial role in the development of type 2 diabetes and dyslipidemia. We establish a Gaussian process regression model for PPAR gamma agonists using a combination of chemical descriptors and the iterative similarity optimal assignment kernel via multiple kernel learning. Screening of a vendor library and subsequent testing of 15 selected compounds in a cell-based transactivation assay resulted in 4 active compounds. One compound, a natural product with cyclobutane scaffold, is a full selective PPAR gamma agonist (EC50 = 10 +/- 0.2 muM, inactive on PPAR alpha and PPAR beta/delta at 10 muM). The study delivered a novel PPAR gamma agonist, de-orphanized a natural bioactive product, and, hints at the natural product origins of pharmacophore patterns in synthetic ligands.Wir untersuchen moderne Kern-basierte maschinelle Lernverfahren für das Liganden-basierte virtuelle Screening. Insbesondere entwickeln wir einen neuen Graphkern auf Basis iterativer Graphähnlichkeit und optimaler Knotenzuordnungen, setzen die Kernhauptkomponentenanalyse für Projektionsfehler-basiertes Novelty Detection ein, und beschreiben die Entdeckung eines neuen selektiven Agonisten des Peroxisom-Proliferator-aktivierten Rezeptors gamma mit Hilfe von Gauß-Prozess-Regression. Virtuelles Screening ist die rechnergestützte Priorisierung von Molekülen bezüglich einer vorhergesagten Eigenschaft. Es handelt sich um ein Problem der Chemieinformatik, das in der Trefferfindungsphase der Medikamentenentwicklung auftritt. Seine Liganden-basierte Variante beruht auf dem Ähnlichkeitsprinzip, nach dem (strukturell) ähnliche Moleküle tendenziell ähnliche Eigenschaften haben. In unserer Beschreibung des Lösungsansatzes mit Kern-basierten Lernverfahren betonen wir die Bedeutung molekularer Repräsentationen, einschließlich der auf ihnen definierten (Un)ähnlichkeitsmaße. Wir untersuchen Effekte in hochdimensionalen chemischen Deskriptorräumen, ihre Auswirkungen auf Ähnlichkeits-basierte Verfahren und geben einen Literaturüberblick zu Empfehlungen zur retrospektiven Validierung, einschließlich eines Beispiel-Workflows. Graphkerne sind formale Ähnlichkeitsmaße, die inneren Produkten entsprechen und direkt auf Graphen, z.B. annotierten molekularen Strukturgraphen, definiert werden. Wir geben einen Literaturüberblick über Graphkerne, insbesondere solche, die auf zufälligen Irrfahrten, Subgraphen und optimalen Knotenzuordnungen beruhen. Indem wir letztere mit einem Ansatz zur iterativen Graphähnlichkeit kombinieren, entwickeln wir den iterative similarity optimal assignment Graphkern. Wir beschreiben einen iterativen Algorithmus, zeigen dessen Konvergenz sowie die Eindeutigkeit der Lösung, und geben eine obere Schranke für die Anzahl der benötigten Iterationen an. In einer retrospektiven Studie zeigte unser Graphkern konsistent bessere Ergebnisse als chemische Deskriptoren und andere, auf optimalen Knotenzuordnungen basierende Graphkerne. Chemische Datensätze liegen oft auf Mannigfaltigkeiten niedrigerer Dimensionalität als der umgebende chemische Deskriptorraum. Dimensionsreduktionsmethoden erlauben die Identifikation dieser Mannigfaltigkeiten und stellen dadurch deskriptive Modelle der Daten zur Verfügung. Für spektrale Methoden auf Basis der Kern-Hauptkomponentenanalyse ist der Projektionsfehler ein quantitatives Maß dafür, wie gut neue Daten von solchen Modellen beschrieben werden. Dies kann zur Identifikation von Molekülen verwendet werden, die strukturell unähnlich zu den Trainingsdaten sind, und erlaubt so Projektionsfehler-basiertes Novelty Detection für virtuelles Screening mit ausschließlich positiven Beispielen. Wir führen eine Machbarkeitsstudie zur Lernbarkeit des Konzepts von Fettsäuren durch die Hauptkomponentenanalyse durch. Der Peroxisom-Proliferator-aktivierte Rezeptor (PPAR) ist ein im Zellkern vorkommender Rezeptor, der den Fett- und Zuckerstoffwechsel reguliert. Er spielt eine wichtige Rolle in der Entwicklung von Krankheiten wie Typ-2-Diabetes und Dyslipidämie. Wir etablieren ein Gauß-Prozess-Regressionsmodell für PPAR gamma-Agonisten mit chemischen Deskriptoren und unserem Graphkern durch gleichzeitiges Lernen mehrerer Kerne. Das Screening einer kommerziellen Substanzbibliothek und die anschließende Testung 15 ausgewählter Substanzen in einem Zell-basierten Transaktivierungsassay ergab vier aktive Substanzen. Eine davon, ein Naturstoff mit Cyclobutan-Grundgerüst, ist ein voller selektiver PPAR gamma-Agonist (EC50 = 10 +/- 0,2 muM, inaktiv auf PPAR alpha und PPAR beta/delta bei 10 muM). Unsere Studie liefert einen neuen PPAR gamma-Agonisten, legt den Wirkmechanismus eines bioaktiven Naturstoffs offen, und erlaubt Rückschlüsse auf die Naturstoffursprünge von Pharmakophormustern in synthetischen Liganden

Hochschulschriftenserver - Universität Frankfurt am Main

Molecular similarity for machine learning in drug development : poster presentation

Author: Proschak Ewgenij
Rupp Matthias
Schneider Gisbert (Prof. Dr.)
Publication venue
Publication date: 26/03/2008
Field of study

Poster presentation In pharmaceutical research and drug development, machine learning methods play an important role in virtual screening and ADME/Tox prediction. For the application of such methods, a formal measure of similarity between molecules is essential. Such a measure, in turn, depends on the underlying molecular representation. Input samples have traditionally been modeled as vectors. Consequently, molecules are represented to machine learning algorithms in a vectorized form using molecular descriptors. While this approach is straightforward, it has its shortcomings. Amongst others, the interpretation of the learned model can be difficult, e.g. when using fingerprints or hashing. Structured representations of the input constitute an alternative to vector based representations, a trend in machine learning over the last years. For molecules, there is a rich choice of such representations. Popular examples include the molecular graph, molecular shape and the electrostatic field. We have developed a molecular similarity measure defined directly on the (annotated) molecular graph, a long-standing established topological model for molecules. It is based on the concepts of optimal atom assignments and iterative graph similarity. In the latter, two atoms are considered similar if their neighbors are similar. This recursive definition leads to a non-linear system of equations. We show how to iteratively solve these equations and give bounds on the computational complexity of the procedure. Advantages of our similarity measure include interpretability (atoms of two molecules are assigned to each other, each pair with a score expressing local similarity; this can be visualized to show similar regions of two molecules and the degree of their similarity) and the possibility to introduce knowledge about the target where available. We retrospectively tested our similarity measure using support vector machines for virtual screening on several pharmaceutical and toxicological datasets, with encouraging results. Prospective studies are under way

Springer - Publisher Connector

PubMed Central

Hochschulschriftenserver - Universität Frankfurt am Main

Graph kernels for chemoinformatics – a critical discussion

Author: M Hartenfeller
M Rupp
M Rupp
M Rupp
M Rupp
Matthias Rupp
O Invanciuc
T Hofmann
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Big Data meets Quantum Chemistry Approximations: The $\Delta$ -Machine Learning Approach

Author: Dral Pavlo O.
Ramakrishnan Raghunathan
Rupp Matthias
von Lilienfeld O. Anatole
Publication venue
Publication date: 17/03/2015
Field of study

Chemically accurate and comprehensive studies of the virtual space of all possible molecules are severely limited by the computational cost of quantum chemistry. We introduce a composite strategy that adds machine learning corrections to computationally inexpensive approximate legacy quantum methods. After training, highly accurate predictions of enthalpies, free energies, entropies, and electron correlation energies are possible, for significantly larger molecular sets than used for training. For thermochemical properties of up to 16k constitutional isomers of C

_7

_{10}

_2

we present numerical evidence that chemical accuracy can be reached. We also predict electron correlation energy in post Hartree-Fock methods, at the computational cost of Hartree-Fock, and we establish a qualitative relationship between molecular entropy and electron correlation. The transferability of our approach is demonstrated, using semi-empirical quantum chemistry and machine learning models trained on 1 and 10\% of 134k organic molecules, to reproduce enthalpies of all remaining molecules at density functional theory level of accuracy

arXiv.org e-Print Archive

Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning

Author: Müller Klaus-Robert
Rupp Matthias
Tkatchenko Alexandre
von Lilienfeld O. Anatole
Publication venue: 'American Physical Society (APS)'
Publication date: 12/09/2011
Field of study

We introduce a machine learning model to predict atomization energies of a diverse set of organic molecules, based on nuclear charges and atomic positions only. The problem of solving the molecular Schr\"odinger equation is mapped onto a non-linear statistical regression problem of reduced complexity. Regression models are trained on and compared to atomization energies computed with hybrid density-functional theory. Cross-validation over more than seven thousand small organic molecules yields a mean absolute error of ~10 kcal/mol. Applicability is demonstrated for the prediction of molecular atomization potential energy curves

arXiv.org e-Print Archive

edoc

Open Repository and Bibliography - Luxembourg

MPG.PuRe

Virtual screening for PPAR-gamma ligands using the ISOAK molecular graph kernel and gaussian processes

Author: Hansen Katja
Müller Klaus-Robert
Rupp Matthias
Schneider Gisbert (Prof. Dr.)
Schroeter Timon
Publication venue
Publication date: 01/01/2009
Field of study

For a virtual screening study, we introduce a combination of machine learning techniques, employing a graph kernel, Gaussian process regression and clustered cross-validation. The aim was to find ligands of peroxisome-proliferator activated receptor gamma (PPAR-y). The receptors in the PPAR family belong to the steroid-thyroid-retinoid superfamily of nuclear receptors and act as transcription factors. They play a role in the regulation of lipid and glucose metabolism in vertebrates and are linked to various human processes and diseases. For this study, we used a dataset of 176 PPAR-y agonists published by Ruecker et al. ..

Springer - Publisher Connector

Hochschulschriftenserver - Universität Frankfurt am Main

Multi-task learning for pKa prediction

Author: Hansen Katja
Rupp Matthias
Sanguinetti Guido
Skolidis Grigorios
Publication venue
Publication date: 18/06/2018
Field of study

Many compound properties depend directly on the dissociation constants of its acidic and basic groups. Significant effort has been invested in computational models to predict these constants. For linear regression models, compounds are often divided into chemically motivated classes, with a separate model for each class. However, sometimes too few measurements are available for a class to build a reasonable model, e.g., when investigating a new compound series. If data for related classes are available, we show that multi-task learning can be used to improve predictions by utilizing data from these other classes. We investigate performance of linear Gaussian process regression models (single task, pooling, and multi-task models) in the low sample size regime, using a published data set (n=698, mostly monoprotic, in aqueous solution) divided beforehand into 15 classes. A multi-task regression model using the intrinsic model of co-regionalization and incomplete Cholesky decomposition performed best in 85% of all experiments. The presented approach can be applied to estimate other molecular properties where few measurements are availabl

RERO DOC Digital Library

Transcranial magnetic stimulation for individual identification of the best electrode position for a motor imagery-based brain-computer interface

Author: Hänselmann Siegfried
Rupp Rüdiger
Schneiders Matthias
Weidner Norbert
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Background: For the translation of noninvasive motor imagery (MI)-based brain-computer interfaces (BCIs) from the lab environment to end users at their homes, their handling must be improved. As a key component, the number of electroencephalogram (EEG)-recording electrodes has to be kept at a minimum. However, due to inter-individual anatomical and physiological variations, reducing the number of electrodes bares the risk of electrode misplacement, which will directly translate into a limited BCI performance of end users. The aim of the study is to evaluate the use of focal transcranial magnetic stimulation (TMS) as an easy tool to individually optimize electrode positioning for a MI-based BCI. For this, the area of MI-induced mu-rhythm modulation was compared with the motor hand representation area in respect to their localization and to the control performance of a MI-based BCI. Methods: Focal TMS was applied to map the motor hand areas and a 48-channel high-resolution EEG was used to localize MI-induced mu-rhythm modulations in 11 able-bodied, right-handed subjects (5 male, age: 23–31). The online BCI performances of the study participants were assessed with a single next-neighbor Laplace channel consecutively placed over the motor hand area and over the area of the strongest mu-modulation. Results: For most subjects, a consistent deviation between the position of the mu-modulation center and the corresponding motor hand areas well above the localization error could be observed in mediolateral and to a lesser degree in anterior-posterior direction. On an individual level, the MI-induced mu-rhythm modulation was at average found 1.6 cm (standard deviation (SD) = 1.30 cm) lateral and 0.31 cm anterior (SD = 1.39 cm) to the motor hand area and enabled a significantly better online BCI performance than the motor hand areas. Conclusion: On an individual level a trend towards a consistent average spatial distance between motor hand area and mu-rhythm modulation center was found indicating that TMS may be used as a simple tool for quick individual optimization of EEG-recording electrode positions of MI-based BCIs. The study results indicate that motor hand areas of the primary motor cortex determined by TMS are not the main generators of the cortical mu-rhythm

Springer - Publisher Connector

Heidelberger Dokumentenserver

PubMed Central